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I. INTRODUCTION 



The subject of data base organization is receiving increased 
attention in the computer community. This is the case because 
data bases are used in almost all types of computer applica- 
tions, e.g., business and scientific applications. The com- 
puter community's involvement in data base studies has been 
limited to specific applications. Recently, it was realized 
that application independent data base systems are important. 
Generalized data base system is the name used for this type 
of system. 

Various groups are making efforts to organize the study 
of data base system. Foremost in the field is the CODASYL^ 
Systems Committee [3, 4], This committee has as its goal the 
development of specifications for a common language and func- 
tions. Concurrent with the initial publication of the CODASYL 
Systems Committee’s report, arguments were raised as to the 
advantages and disadvantages of generalized data base systems 
as- compared with the traditional approach of application 
dependent data base systems. To fully understand the differ- 
ent arguments in favor of or against an approach, one needs 
to delve deeper into the subject of data systems. This paper 
is an attempt to put into perspective the multiple and 
diversified data base system developments. 

^CODASYL is the abbreviation of Conference on Data Systems 
Languages . 
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II. CLASSIFICATIONS OF DATA BASF: MANAGEMENT SYSTEMS 



Data base management systems have various characteristics. 
The CODASYL Systems Committee in its technical report dated 
May 1969 gave the following major data base characteristics: 
data structure class, storage structure class, generalized 
processes provided, language type, language form, modes of 
use, file media, hardware environment and operating systems 
environment. Almost all of these major characteristics have 
subclassifications, e.g., under generalized processes pro- 
vided there are file definition, file creation, file updating 
and interrogation. 

The ensuing discussion will deal with the more important 
areas of data organization characteristics. The system eval- 
uation process will also be discussed as much as possible. 

A number of these characteristics are dependent. An example 
is the file media option. One would keep in mind physical 
storage characteristics when selecting a data organization. 
Also, the hardware and operating systems environment must be 
considered. Sometimes these facilities are fixed beforehand. 

In other cases, the computer system is selected in addition 
to the data base system. 

A. DATA STRUCTURES 

Data structure is the logical or hierarchical relationships 
among elements in the data base. In contrast, storage struc- 
ture is the actual organization of data elements on the 
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physical storage system. And since data relationships usually 
exist among records, data structure can further be viewed as 
the relationships among records. Based on this viewpoint, 
data structures are classified according to how the records 
are cross-referenced. There are three general classifications 
of data structures: (1) Multiple record file: In this clas- 

sification no record "owns" any record. Records are related 
to each other only in their physical positions; hence, their 
relationships can be defined only by logical keys. For 
example, for records stored sequentially, a record's key may 
be smaller or greater than another record's key, depending 
on its position relative to the other record. (2) Hierar- 
chical file: This type of data structure is sometimes called 

a tree structure because of the schematic representation of 
its data relationships. In this classification a record can 
"own" any number of records but it can be "owned" only by one 
record. For an example, in Figure 1, record D "owns" records 
I and J and is "owned" by record A only. (3) Network struc- 
ture: In this classification a record can "own” or be "owned" 

by any number of records. For an example, in Figure 2, record 
D "owns" records J and J and is "owned" by both records A and 
C. [London, 1973] . 

Due to the advent of direct access storage technology, 
multiple record files are not as numerous as they once were. 
Thus, to a great extent, tree and network data structures are 
used in relating data within a file or from one file to 
another. Further, these data structures are found to be more 
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HIERARCHICAL OR THREE DATA STRUCTURE 
Figure 1 
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NETWORK DATA STRUCTURE 
Figure 2 
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efficient for many data base applications. Chains and chain- 
modifier ring structure and list structure are two main methods 
used in cross-referencing data in tree and network data 
structure . 

1 . Chain Data Structures 

Chaining is a technique of linking logically-related 
records by means of pointers; pointers are special fields 
incorporated into each record structure which contain refer- 
ence(s) to the next log ically- related record(s) . A group of 
records so structured and linked is called a chain. 

Chaining has a number of advantages [London, 1973] . 

Data in chains can be accessed by several keys; the number 
of keys depends on the number of independent chains a record 
belongs to. The use of multiple record keys allows accesses 
to records. It also allows records to be related in a number 
of ways. Further, it allows data retrieval in a logical 
sequence regardless of the physical sequence of records. 

Another advantage of chaining comes to light when additional 
reports are required after the system had been installed. 
Chaining allows users to produce newly required reports with- 
out necessarily restructuring the file. 

As to the disadvantages [London, 1973] , the most crit- 
ical is the cumbersome updating procedure which requires 
significant processing time. This disadvantage, however, 
could be decreased to a reasonable level by planning the 
system updating. Two main factors have to be watched in 
regard to system updating: the frequency of additions and 
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deletions of records in the chains and the frequency of data 
changes within the established chains. (The former is usually 
referred to as the volatility of data.) Other disadvantages 
of chaining are record growth and the length of chains. 

Search time to retrieve items from the chains is directly 
proportional to the size of the records and the length of the 
chains. Thus, although this disadvantage is less critical 
than the former, it is by no means unimportant. 

2 . List Data Structures 

A list data structure is comparable to chaining and 
is conceptually an extension of chaining. However, instead 
of using pointers to link records, the list data structure 
technique uses list, such that each list of addresses (or 
records) contains logically related items only. Lists are 
used for the primary purpose of reducing the length of search 
time (the second disadvantage of chaining) . A type of list 
is the simple list. 

A simple list approach is used in conjunction with 
chains of records. If chain length is to be restricted or 
if- the search time is to be reduced, the chains are divided 
into a number of segments, each of which owns an entry point. 
The table of these entry points is the list . The entry points 
are used as the primary keys in searching the file. Actual 
addresses can be used as entry points. Variations are pos- 
sible in building lists. The primary keys may be made to 
hold more than an entry point to a chain if the chain is long. 
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On the other hand, the number of primary keys can be reduced 
if the chains are relatively short. 

Figures 3, 4, 5 and 6 are diagrams of a simple list 
and its variations. It is assumed that a storage drum would 
be used. The categorization of records per specific char- 
acteristic is termed an attribute. Further, in these figures 
it is assumed that the primary index would be composed of 
insurance policy numbers of car owners. Chains A, B, C, D, 
etc. are used to designate eacli thousand series of policy 
numbers . 



In Figure 3, the number of records is on the average 
just enough to occupy one track of the storage drum. This 
facilitates the storage scheme of chains A, B, C, D, etc. 

Each chain could be stored on one track each. 

If the average number of records related to each attri- 
bute is large, it would be necessary to store the records on 
more than one track. In this situation each primary index 
entry can be made to point to more than one track. This is 
what is done in Figure 4. The index entry of 1000 is used 
to point to track #1 and track #2. Also, the primary index 
entry of 2000 is used to point to track tf 3 and track ft 4. Note 
that pointers are needed from the last entries of track SI and 
track it 3 to the first entries of track #2 and track #4, re- 
spectively . 

In Figure 5 is another list variation. The number of 
records per attribute is small and can be contained on some 
part of eacli track. Records related to other attribute (s) 
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Pointers to 
policy numbers 




Attributes: 1000 series, 2000 series 

Primary keys: 1000, 2000, 3000, 4000, 
Entry points: 1000, 2001, 3110, 4000, 
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SIMPLE LIST DATA STRUCTURE 
Figure 3 
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Pointers to expiration dates 




Primary index (main attribute) : 1000, 2000, 3000, -- 
Secondary index (second attribute): January, February, March, 



LIST DATA STRUCTURE KITH A SECOND DEGREE OF INVERSION 

Figure 6 
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could still be stored in the unused space of the tracks to 
fully utilize them. Hence in Figure 5, the pointers could 
be made to point to every other attribute (1000, 3000, 5000, 
7000) . 

Assume that we want another list (secondary index) 
that would be composed of insurance expiration dates. Since 
the records are made contiguous on the basis of policy numbers, 
we need pointers to follow the sequence of insurance expiration 
dates. This is shown in Figure 6. Since there would be two 
lists, the primary and the secondary indices, the scheme in 
Figure 6 is a second degree inversion. 

Simple lists are extensions of chaining, hence they 
have the same disadvantages, except that the extended search 
time in chains is eliminated. The search capability of lists 
is of considerable use. Much of the searchings is done in 
the list, thus unnecessary retrieval of records is avoided. 

The degree of search effectiveness of a list is directly pro- 
portional to the degree of file inversion. Inversion of files 
or the inverted list will be discussed later. However, the 
simple list requires more storage space than chaining. De- 
pending upon the application, the reduced search time may well 
be worth the additional space. 

The extreme of list data structuring is the inverted 
list. This approach requires one entry for each type of attri- 
bute in the file; hence, the lists would hold the references 
between records. Conceptually, the inverted list is a series 
of lists of pointers to data records. The lists can be held 
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at the beginning of the file or at another location. The 
inverted list's record insertion and deletion procedure is 
the same as that of a chain; i.e., record insertion and de- 
letion is done in the file area. However, the difference is 
that all changes to the status and linkages between records 
are done in the independent list area. The inverted list 
provides a very flexible response to user retrieval requests. 
This type of list provides for data retrieval on the basis of 
variable parameters. To illustrate this capability, assume 
that conditions A, B, C and D are the variable parameters to 
be used as the basis for record retrieval; i.e., only records 
that meet these conditions are to be retrieved. The file to 
be maintained for this example would include lists maintained 
for each condition. Upon request for the record(s) that meet(s) 
the given conditions, the different condition lists would be 
searched and its intersection would be the required record (s). 
Processing time to meet this kind of requirement would be much 
reduced if the condition that has the smallest number of 
records is searched initially, then the chosen records could 
be searched in the next smaller list and so on. This technique 
is often referred to as the least list principle. Another 
technique for reducing processing time is to numerically order 
the records for each condition and then apply the appropriate 
search algorithms. 

The use of inverted lists has two principal advantages. 
The primary advantage of this approach is that complicated 
retrieval requests can be processed efficiently. Another 
advantage is that processing time for updating is much reduced 
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when compared to chaining, since most of the updating is done 
in the lists rather than in the file area. With this type of 
approach, no pointer is incorporated with the data in the 
record structure; thus, the reduction of record size could 
somewhat balance the storage space occupied by the lists. 

Based on the aforementioned arguments for chaining 
and lists, the choice of an inverted list is warranted if 
selective data retrieval is frequent. This question is some- 
thing the evaluator (with the guidance of the users) has to 
decide. A statistical analysis should be made of the frequency 
of retrieval of data elements prior to the selection of a data 
structure . 

To go a step farther, one may take into consideration 
the possibility of using an involute type of data structure. 

An involute structure is a logical extension of the inverted 
list. It is a structure in which the data elements replace 
the pointers to records from the lists. The physical records 
are replaced by linkages of chains, each element of which 
belongs to a list. Each list would contain all the data 
elements belonging to it, hence elements could appear in any 
number of lists (up to its maximum) . This structuring provides 
complete flexibility in file accessing and producing reports. 
However, the main arguments against this type of structuring 
are its increased processing time and its increased storage 
space requirement to maintain the lists. 
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B. STORAGE STRUCTURES AND FILE MEDIA TYPES 



As previously mentioned, storage structure is the actual 
physical data organization in the system. Generally there 
are two possible ways of storing the data: sequential or 

non-sequential (random). The method of data storage is in- 
variably dictated by the type of access that may suitably be 
adapted after selecting the data structure design. Three 
types of access may be sequential, indexed-sequential and 
direct. Sequential access does not lend itself to random 
storage of data, since the retrieval of records is based on 
some pre-designed logical sequence. For indexed-sequential 
access, sequential storage of data is applicable. The index 
and actual records are sequentially stored. Direct access of 
data for retrieval is accomplished by using an algorithm which 
converts between the record key and physical address. 

The following is a tabular summary of storing and accessing 
of records: 

TYPES OF ACCESS : TYPE OF STORAGE 





SEQUENTIAL : 


RANDOM 


Sequential 


: Yes : 


No 


Indexed-sequential 


: Yes : 


No 


Direct 


: Yes 


Yes 



The selection of file media is dependent on the storage 
structure. If random storage is desired, use of magnetic 
tapes is not feasible. Tapes could be considered as file 
media only if (1) the data are to be stored and accessed in 
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sequence, (2) the file is small and/or (3) the frequency of 
access is small. 

The storage structure to be adapted should be a result of 
the users' requirements (speed, space, etc.) and cost analyses. 
Table I provides a comparative cost of a sample of storage 
units. It should be noted that the processor storage has the 
highest cost per byte and the magnetic tape and data cell have 
the lowest cost. Usually, the faster the storage unit, the 
higher the cost. For each type of storage unit the cost 
difference is usually directly proportional to the capacity. 
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COMPARATIVE COSTS OF STORAGE UNITS* 
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Above IBM storage equipments are available at W. R. Church Computer Center, NPS. 
Costs are based on the IBM price list for fiscal year 1975. 



III. ANALYSIS OF DATA STRUCTURING TECHNIQUES 



Data organization is a term applied to the various methods 
of organizing data within a file. While Dodd's term is data 
organization, Lefkovitz calls it file organization. [House, 1974 
and Lefkovitz, 1969]. There arc various schemes of organizing 
data within a file. Dodd categorized all these schemes into 
three basic types of organization: sequential, random, and 

list . 

From the preceding section four distinct types of data 
structures can be related to these data organizations' basic 
classifications offered by Dodd. They are: multiple record 

file, chaining, simple list and inverted list. Sequential data 
organization, as Dodd defined it, is associated with a multiple 
record file. Chaining, simple list and inverted list would be 
included in Dodd's list classification of data organization. 
Three data structuring methods are analyzed in this section: 
multiple record file, simple list and inverted list. They are 
used to describe the different data organization techniques. 

It is evident from Section II that data structure is a factor 
of prime importance in the evaluation of the data base manage- 
ment systems. 

In evaluating data base systems, various criteria have to 
be considered. Different authors give different sets of cri- 
teria. The following set of criteria is deemed comprehensive 
and the most important one: retrieval capability, maintenance 
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capability, storage requirements and accuracy. The analysis 
of the three data structure methods is based on these criteria. 

A. RETRIEVAL CAPABILITY 

Figure 7 is a representation of the time involved with data 
retrieval. Time delays can occur between the central processing 
unit and the file storage, depending on the data structure type, 
available memory and index size. This will be explained later. 
Other factors contributing to the time duration between query 
and response from the system are not considered because they 
are not relevant to the analysis of the data structures. One 
of these contributory factors is "handshaking," a term used to 
designate the process of synchronization between the terminal 
and the computer when a query is logged in. 

In the analysis we are involved only with the CPU proces- 
sing time and storage units time. These are the time elements 

that vary with the data base configuration. The parameters in 

2 

the time formulation are related to these factors. Table II 
is a listing of these parameters, definitions and symbols. 

The file- and query- related parameters are dependent on the data 
structure. The device-related parameters are dependent on the 
file media characteristics. 

The file-related parameters are: the total number of 

distinct keys, total number of records in the system, average 
number of keys per record, average length of lists and average 
2 

The parameters and time formulation to follow are mainly 
by Lefkovitz. In some instances, a different terminology is 
used . 
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TERMINAL 




T = retrieval time 
(portion of it) 



PORTION OF DATA RETRIEVAL TIME TRACE 
Figure 7 
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TABLE II 



SYMBOL DEFINITIONS OF FILE PROCESSING PARAMETERS 



Symbol : Definition : Parameter Type 



V 


Number of distinct keys in 
vocabulary 


File related 


N r 


Number of records in system 




N k 


Average number of keys per record 




L 


N r Nk 

Average list length = — y — 




N t 


Average number of keys in a single 
query product 


Query related 


N 

P 


Average number of non-negated 
terms in a single query product 




L s 


Average shortest list length in 
query 




P 


Ratio of query response to L s 




A 


Number of file record addresses 
per media physical record 


Device related 


T r 


Random access time of media 




R 


Rotation time of media (direct 
access only) 
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number of characters per record. The keys are the pointers 
listed in the index or indices for the list-type of data 
structure. The average of the number of keys per second is 
computed over all the pointers to each record. 

The query- related parameters are: average number of keys 

in a single query product, average number of non-negated keys 
in a single query product, average shortest list length per 
query and the average ratio of query response to the average 
shortest list length per query. The term "query product" as 
used here refers to the record retrieved as the result of the 
query. The non-negated keys are the keys not eliminated from 
consideration in the process of searching record pointers 
within the index. 

The file media-related parameters are: number of file 

record addresses per file media record, random access time of 
file media, transfer rate of file media, and for direct access 
storage devices (DASD) , rotation time. 

Excluding query interpretation time and the index directory 
decoding time, the list search retrieval times are: (1) for a 

simple list: 

Time = List search and record transfer time 
and (2) for an inverted list: 

Time - List intersection time + List search and 
record transfer time. 

Using the symbols, the formulae are: (1) for a simple list: 
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3 



T SL = M T r + 
and (2) for an inverted list: 

t IL = £ (N t )(T r + 1 ‘ 5R) + (T r + 1-5R) (L s ) (p). 

1 . CPU Processing Time 

The CPU processing time varies with the data structure 
type. For a multiple record filing scheme, the CPU processing 
time of a query may not be significant. Upon receipt of the 
query, the CPU would determine which file is to be processed 
and the storage unit would be accessed and reaccessed until the 
desired record is found. For the list-type of data structure, 
the processing of the query would be more involved. 

Many factors affect this interpretation process of the 
query by the CPU. One of prime importance is the index and 
file partitioning technique. When partitioning the indices of 
a complicated list data structure, the frequently accessed 
indices should be stored in fast DASD. In this way, the sum- 
mation of indices access time is reduced. The same consider- 
ation applies to the partitioning of the records of the file 
among the file media available. 

Another factor is the index size. The index must be 
accessed and transferred to the main memory if it is not in 
main memory. If the index is large and cannot be contained 
in the available portion of the main memory, a succession of 

3 

The multiplier 1.5 of R is arrived on the assumption that 
a physical record occupies one track. Otherwise this multiplier 
would be the sum of the average track rotational delay (.5) and 
the fraction of the track occupied by a physical record. 
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accesses and searches of the index may have to be done before 
the right pointer to the record is determined. For a simple 
list and an inverted list of the same number of records, the 
inverted list is likely to have a larger index size because 
the number of pointers to each record is usually more than one. 
For a simple list, there is only one pointer for each record. 

If the file involves more than one degree of inversion, 
the sequence of index access and search is also important. 
Usually, the shortest lists should be accessed and searched 
first in order to minimize the total length of the index to 
be operated on. 

For a complicated list, the associativity of the records 
is another factor that affects the time duration of query pro- 
cessing by the CPU. If the records have minimum association, 
i.e., only one or a few keys are related to each record, then 
the fewer indices need to be intersected. This implies that 
less time would be required to retrieve a record from the data 
base . 

Record selectivity is another factor. The file is con- 
sidered to be highly selective if each key points to only one 
record in the data base. A completely inverted list, therefore, 
is of highest selectivity. In this case, while the time for 
list intersection would be longer, the search and access pro- 
cess of the record would be shorter than for the simple list. 

2 . Storage Unit Operating Time 

The type of media that is used will affect the time 
duration between query and response. Usually the indices and 
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the records of the file would be on the peripheral device(s). 
The choices are many: drums, discs, magnetic tapes, paper 

tapes, etc. 

The main factors to be considered in selecting a file 
media are the cost of the equipment and its speed. It is 
important to note that new technology is forcing reduction in 
storage cost. 

Table III is a sample of DASD timing factors. For 
typical magnetic tapes, random access time is one minute and 
serial access is about four milliseconds. See Table I for a 
sample of storage unit capacity and cost. Typical file media 
used today are: 

(1) Card . Slowest storage unit but inexpensive. 

(2) Paper Tape. Least expensive storage unit that could be 
used for direct communications to the computer. 

(3) Magnetic Tape. Serially processed; inexpensive mass 
storage media for large files; could be used for very fast 
off-line data transmission. 

(4) Small Drums. Small backing storage for high activity 
data or programs; may hold indices. 

(5) Small Disc Unit. Interchangeable file media; fast for 
serial processing; requires seek time of about 50 to 200 
milliseconds before access. 

(6) Large Disc Unit. Larger capacity; access time about 
40 to 180 milliseconds. 

(7) Large Drum. Largest fixed storage presently available, 
about a billion characters; fast access time of about 30 to 
50 milliseconds. 

(8) Magnetic Card File. Interchangeable cartridges of 
cards; access time of about 235 milliseconds. 

(9) Large Magnetic-Strip File. Large random access file; 
interchangeable bins of strips; access time of about 90 to 
400 milliseconds; can be used for fast serial processing. 
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TABLE III 



SAMPLE OF IBM DIRECT ACCESS STORAGE DEVICES 



: : Rotation: T r 

Model : DASD Type : (ms) : (ms) 



2301 


Drum 


17.5 




1301 


Movable head disk 


34 


120 


2311 


Disk pack (movable head) 


25 


75 


2321 

(Data cell) 


Magnetic strip, cell store 
(movable head) 


50 


400 
3 50* 
500** 



* Previous strip restored 

** Previous strip not restored 



(10) Large Core Memory. Most expensive; fastest access 
time; access time of about 8 microseconds. [Martin, 1967]. 

Disc units may be either removable or nonremovable packs. 
Removable packs have longer access time, lower recording den- 
sity and lower cost per byte than nonremovable packs. Further, 
the former offers unlimited off-line storage capacity while 
the latter's capacity is limited to the on-line storage. 

Partitioning of files among the file media is not the 
only technique available to the evaluator or designer of data 
base systems for maximizing the speed of file operations. 
Various techniques of reducing seek and access times of DASDs 
have been developed. The main approaches to reduce DASDs' 
operation times are the following: (1) Records that are 

usually accessed in sequence are selectively stored on the 
device. Record processing time after an access is computed. 
After accessing a record, rotation time is allowed equivalent 
to the computed processing time. In this way the next record 
can be read immediately and no unnecessary rotation of DASD 
is incurred. (2) Another algorithmic approach to reduce the 
device operation time is to time the rotation and availability 
of read heads (applicable only to multi-head DASDs). See pages 
293 to 296 of Ref. 14 for above algorithms and examples. 

3 . Data Structure Types and Data Retrieval Time 

To review the preceding discussion, we note that the 
causes of the differences in data retrieval times attributable 
to data structure are: the time spent by the CPU on processing 

of query, index and record and the storage unit operating time. 
For convenience, we would designate this portion of data 
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retrieval time as the data retrieval sub-time (T In 

v sub' 

this respect, there are three data structure types that are 
under consideration: multiple record file, simple list and 

inverted list. The following formulations are based on a single 
record retrieval. 

a. Multiple Record File and Data Retrieval Sub-time 

Under the multiple record file type of data struc- 
ture, once the query for a record is received and interpreted 
by the CPU, the search for the record will start. No index 

or indices have to be accessed and searched and intersected 
for the record address. Rather, a succession of accesses and 
CPU processing of records might have to be done before the 
desired record is found. This CPU processing is actually the 
comparison process of records. With this type of data struc- 
ture, the records should be stored sequentially, especially 
if magnetic tapes are used. If records are randomly stored, 
random accessing of records has to be done. 

When DASDs are used, other search algorithms are 
possible. One of these is the logarithmic or binary search. 

b. Simple List and Data Retrieval Sub-time 

Data retrieval sub-time for this type of data 
structure has been partially formulated in the previous dis- 
cussion. Index decoding time, however, has to be added to 
the previous formulation. Sub- time would be 



T u = T CI + N T or 

sub SL p cpu 



See Figure 7 for a complete picture of the data retrieval 



time 
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sub 



= L ( T + 1 . 5R ) 



+ 




T 

cpu 



Assignment of values for T and R are dependent on the storage 
unit average operation time (see Table III). 

c. Inverted hist and Data Retrieval Sub-time 

Sub-time formulation for this data structure type 
has also been partially done. Like the simple list, the index 
decoding time has to be added. The resultant equation is 



t = t + M T 

sub 1 1 L N t cpu 



or 



T , = -?- ( N + ) (T + 1.5R) + pL (T + 1.5R) + N. T 

sub A v t ; ^ r 1 s^r 1 t cpu 



or 



T sub ={ t( N P * pL s HT r + !- 5R) + N t T 



cpu 



Also like the simple list, assignment of values for T and R 
are dependent on the storage unit operating times; sample values 
are listed in Table III. 

Note that the T for the list structures would be 

cpu 

different from T of the multiple record file. The T , times 

cpu 1 cpu 

are dependent on the number of machine instructions necessary 
to compare records and to decode the index. 



B. MAINTENANCE CAPABILITY 

In some data base systems, more time is expended on main- 
taining files than on searching them. Because of this situation, 
file maintenance speed and efficiency determine the cost and/or 
feasibility of the entire system. It is reported that 20 to 30 
percent of total machine time is spent for the file maintenance 
function of sorting in general business applications. [Meadows, 
1967] . 
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Sorting and merging are the best known as well as the most 
common file maintenance functions. Another maintenance function 
is the initial file creation. The file maintenance functions 
can be subdivided into: (1) addition of records to a file, 

(2) deletion of records from a file, (3) changing the value 
of a record's field, (4) changing the record's structure, (5) 
changing record sequence in a file, and (6) changing the file 
storage media. The fifth function is sorting. 

This subsection will describe the capabilities of the three 
general data structures - -mult iple record file, simple list and 
inverted list, in terms of the above functions. Simple and 
inverted list will be included under lists structures. 

1 . Maintenance of a Multiple Record File 

The maintenance of a multiple record file is trivial 
unless a sequence of records other than their arrival sequence 
is desired. If records are not sequenced, the new records are 
usually added at the end of the file. In this case no complex 
search for spaces for the records to be inserted is required. 

The situation is entirely different if a record sequence is 
maintained. Here oftentimes the entire file must be copied. 

Deletion of records can be done in more than one manner 
with this data structure type. One method of deletion for this 
data structure is to flag the records to be deleted as such by 
changing a value of one of their fields. By this method the 
records to be deleted would not be physically removed from the 
file. Another method of deleting records is to copy the entire 
file, excluding the records to be deleted. The latter method 
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is necessary if space is needed for new records. The changing 
of records content or structure is done by changing, adding 
or deleting fields in memory. The fifth function (sorting) 
involves alteration of record sequence. Multiple passes through 
the field may be necessary to accomplish this. The last func- 
tion, which involves changing the files medium, is considerably 
a simple process. 

2 . Maintenance with List Structures 

Different procedures for the addition of records to a 
list data structure are diagramed in Figures 8, 9, and 10. 
Variations among the different procedures are dependent on the 
structure of the file to which the new records are to be added. 
The records may be arranged in sequence logically or physically 
within the main file. For example, in Figure 9 the new records 
are logically sequenced within the main file by means of point- 
ers. In this figure, the insurance number and the age pointers 
are used to point to the numerical sequence of insurance num- 
bers and age, respectively. The updating procedure for the 
main file using this scheme is complicated because the pointers 
must be updated. In Figure 10, the new records are physically 
sequenced by hundreds of insurance numbers per track of the 
disc. After placing the new records in the main file, the 
records are posted to the appropriate indices. This posting 
operation is a complex one. It requires either that the index 
files (indices) be copied as records are being added at many 
points throughout the file [Figure 8] , or that chaining 
[Figure 9] or list structures [Figure 10] be used to permit 
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ADDITION FUNCTION OF MULTIPLE RECORD TYPE OF INDEX FILE 
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ADDITION FUNCTION OF CHAINED-TYPE OF INDEX FILE 




4-1 
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ADDITION FUNCTION OF LIST-TYPE OF INDEX FILE 




noncontiguous placement of records in the index files. The 
use of chaining within the index files is a complicated scheme 
because the succeeding record pointers have to be accessed and 
altered for every record that is added. The complexity of the 
process increases as the degree of file inversion increases. 

The main advantage attributed to list file, with regard 
to addition of records, is that the logical sequence of records 
can be continuously maintained. The copying of files, which is 
often necessary for multiple record files, is avoided. At 
most, only the index files have to be copied. For a simple 
list there is only one index. However, for inverted lists the 
number of indices that may need to be copied can be large. 

The process of record deletion with list files is also 
complex. If chaining is used within the main file or the index 
files, a complicated process of updating the pointers would be 
required. A further problem involves the accounting for spaces 
available to the records within the entire file. This would 
increase the processing time and grab spaces from the addition 
or deletion process during execution. 

The process of changing the record values would be a 
simple and fast process with list files. Pointers could be 
traced to the required records. The only factor that could 
cause prolonged search time for the records are pointers that 
span a large portion of the file. In such cases, numerous 
transfers of records between main and auxiliary memory would 
occur . 
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A record sequence change is the most difficult main- 
tenance function of list files. Many accesses to records and 
alterations of pointers may be necessary for relatively few 
changes in sequence. The last function of changing the file 
medium is a simple process; it involves copying the file from 
one medium to another. 

C. STORAGE REQUIREMENTS 

Another criterion is storage requirements. Auxiliary 
storage requirements are analyzed in this subsection. 

1 . Auxiliary Storage Requirements of Data Structures 
There are three basic methods of storing records on 
auxiliary devices. They are: sequential records, indexed 

records, and chained records. In sequential storage, records 
relationships are based on record adjacencies. In indexed 
record storage, indices are used to relate records to one 
another. In chained record storage, related records are chained 
together . 

Figure 11 depicts sequential record storage. Figure 
11(a) is a sequential storage of fixed length records. Note 
that 14 bytes of spaces are wasted. Figure 11(b) is a sequen- 
tial storage of variable length records. The total spaces 
required are 37 bytes, a saving of 8 bytes. The difference 
would be great as the number of records increases. The situa- 
tion could be the opposite though. This would occur if the 
records could be made of the same length. For example, if 
dates would be represented as numbers in records, say "010275" 
for January 2, 1975, "050475" for May 4, 1975, then it is better 
to use the fixed length records. 
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Record Number 



Length 



1 


2 


3 


4 


5 


LEE 


MACDONALD 


PETERSON 


BELL 


DAVIDSON 


i 9 1 


l — 9 


9 — < 


► 9 — i 


f— 9 — * 



(a) Sequential Storage of Fixed-Length Records 
(all records are 9 bytes long) 



Record Number 1 2 3 4 5 



3 


LEE 


1 — 

9 | MACDONALD 


8 


PETERSON 


4 


BELL 


8 


DAVIDSON 


'1 


'-3 - 


>1< 9 


-1- 


| g H 


l ~l 4 — 4 — 1 




: g , 



(b) Sequential Storage of Variable Length Records 



SEQUENTIAL STORAGE OF RECORDS 
Figure 11 



In Figure 12 are diagrams of indexed records storing 
schemes. Figure 12(a) is a single index record storage scheme. 
Figure 12(b) is a multiple index record storage scheme. Note 
that in Figures 12(a) and 12(b) the use of fixed length records 
would add to the space requirements. Compared to the sequential 
storage of variable length records, additional space is required 
for the indices. The additional space required would increase 
with the degree of file inversion. Possibly the index tables 
may exceed the size of data. This is the price one must pay 
for the capability to relate records using list files. 

Chained record storage requires more space than sequen- 
tial record storage. Added space is required for the record 
pointers. Compared to index records storage, this scheme re- 
quires no space for an index. Figure 13 depicts chained record 
storage for a single and multiple chains. In this figure the 
pointers are used to follow the numerical sequence of the 
attributes of social security number and age. 

D. ACCURACY OF DATA RETRIEVAL 

Accuracy can be defined as the degree of data retrieval 
correctness subject to some degree of effort exerted. This 
data retrieval accuracy concept can be subdivided into what 
is commonly known as the recall and precision ratio. The 
recall ratio is defined as the degree of success in retrieving 
relevant data from a relevant system. Recall ratio is the 
system's capability to let through the desired data. The pre- 
cision ratio is the system's capability to hold back unwanted 
data. Quantitatively, these ratios could be expressed as: 
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Index Table 



Record Structure 




1001 



1008 



1021 



1033 



1041 



(a) Single Index Record Storing (simple list) 

Record Structure 



Index Tables 
Name Index 



tame! SS# 



DATA 



Key 


Location 










LEE 


1001 


Data 


File 






MACDONALD 


1023 


1001 


LEE 


010232489 


DATA 


PETERSON 


1045 


1023 


MACDONALD 


123456789 


DATA 


BELL 


1067 


— 1045 


PETERSON 


543210987 


DATA 


DAVIDSON 


T089 


t/A 1067 


BELL 


090909090 


DATA 


~/7tA 1089 


DAVIDSON 


987654321 


DATA 



Social Security Number Index 



Key 


Location 


010232489 


1001 


090909090 


1067 


123456789 


1023 


543210987 


1045 


987654321 


1089 



(b) Multiple Index Record Storing (inverted list) 



INDEXED RECORD STORING 
Figure 12 
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Record Structure 



NAME 



AGE 



AGE POINTER 




LEE 


23 


1061 


MACDONALD 


26 


1031 


PETERSON 


29 


1999 


BELL 


21 


1001 


DAVIDSON 


25 


1016 



Age Pointer 



A 



(a) Single-Chain Records Storing 



Record Structure 



NAME 


ACE 


SS# 


AGE POINTER 


SS# POINTER 




Social 
Security No. 
Pointer 



LEE 


23 


010232489 


1113 


1085 


MACDONALD 


26 


123456789 


1059 


1059 


PETERSON 


29 


5432] 0987 


1999 


1113 


BELL 


21 


090909090 


1001 


1029 


DAVIDSON 


25 


987654321 


1029 


1888 



(b) Multi-Chain Records Storing 
CHAINED RECORD STORAGE 
Figure 13 
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Recall Ratio 



Number of Relevant Data Items Retrieved by the System ^ ^ ^ 

Total Number ol Relevant Data Items in the system x 

Precision Ratio = 

Number of Relevant Data Items R etrieved by the System -j^q 
T otal Number of Data Items Retrieved by the System x 

To fully understand these ratios, the following example is 
considered : 

Suppose there are 30 relevant records in an inquiry 
to a data base. A system search is conducted and 25 of these 
relevant records are retrieved. The recall ratio is 25/30, or 
about 83%. Assume that another 20 irrelevant records have been 
retrieved with the 25 relevant ones in the process. The total 
retrieved records are 45, hence the precision ratio is 25/45, 
or about 56%. It is usually stated that the search has a 83% 
recall at a precision of 56%. It has been observed that as the 
recall ratio increases the precision ratio decreases: A typical 

plot of recall ratio versus precision ratio is shown on the next 
page. [Lancaster and Fayen, 1973] . 

Data structure type is certainly a relevant factor in 
determining a system's data retrieval accuracy. Unfortunately, 
this is not the only factor that affects the system's accuracy. 
Size of data base and volatility of data are other factors. Due 
to the multiplicity of factors, it is difficult to determine the 
accuracy of retrieval. One may approximate the system's accuracy 
by simulation. Simulation should, however, be the last recourse. 
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Precision Ratio (I) 



100 r 



50- 




RECALL RATIO (%) VS. PRECISION RATIO (%) PLOT 

Figure 14 



100 
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E. TRANSIENCY 



Transiency of a data base is its tendency to be disorganized. 
The more transient a data base, the more time and processing 
would be required for its reorganization. Data structure type 
affects the system's transiency. Logically, transiency would 
increase with the degree of inversion. Transiency would also 
increase with the degree of chaining utilized. 
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IV. SUMMARY 



In data base system selection, one may hope for but should 
not fully expect a perfect system. The circumstances impacting 
on the evaluator or designer would considerably affect the 
selection decision, e.g., financial and political considera- 
tions. Another factor is the possible contradiction in users' 
requirements. A case in point is a requirement for a data base 
system with a powerful search capability with minimum storage 
space. Finally, technological progress is a factor. As one 
completes a large data base system design, developments may 
occur which could be utilized in the system design. 

Various topics regarding data base systems have been dis- 
cussed in this paper. They involve current practices in this 
field. Possible tradeoff considerations have been emphasized 
where they are important. The central topic in this paper is 
data structures. This subject area has been emphasized because 
it is important with respect to the selection of storage struc- 
tures, file media and file maintenance functions. This implies 
that one can judiciously select data base systems by placing 
great emphasis on data structure considerations. 

Four categories of data base criteria have been analyzed: 
retrieval capability, maintenance capability, storage require- 
ments and accuracy of data retrieval. It is by no means 
claimed that these criteria are complete. Using these criteria, 
however, could provide some assurance of the correctness of the 
data base evaluation. 
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